Description

Background and Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

  1. To predict whether a liability customer will buy a personal loan or not.

  2. Which variables are most significant.

  3. Which segment of customers should be targeted more.

Data Description

Data Dictionary

Let's start coding!

Importing necessary libraries

Observations

Observations

Data Preprocessing

Processing columns

Create Dummy Variables

Values like 'Undergrad' cannot be read into an equation. Using substitutes like 1 for Undergrad, 2 for Graduate and 3 for Professional would end up implying that Grad degrees fall exactly half way between Undergrad and Professional degrees! We dont want to impose such a baseless assumption!

So we create 3 simple true or false columns with titles equivalent to "Is this degree Undergrad level?", "Is this degree Grad level?" and "Is this person from Professional level?". These will be used as independent variables without imposing any kind of ordering between the three regions.

Observation:

Observation:

Feature Engineering¶

Observation:

Exploratory Data Analysis

Let's check the statistical summary of the numerical variables.

Univariate Analysis

Histogram

Observations

Age

Observations

Barplot

Observations

Bivariate Distributions

Let's check the correlation between numerical variables.

Observations

Observations

Zooming into these plots gives us important information.

Let's check the variation in Income with Age.

Income vs Age

Observations:

Income_log vs Age

Observation:

Income vs Experience

Observation:

Dropping Income Column:

Data Processing Continued

Spliting the data

Model Building:

Logistic Regression

We have completed splitting the data into train and test to be able to evaluate the model that we build on the train data.

We will build a Logistic Regression model using the train data and then check it's performance.

Observations:

Observation from the confusion matrix:

Build Decision Tree Model

Checking model performance on training set

Checking model performance on test set

Observation:

Visualizing the Decision Tree

Using GridSearch for Hyperparameter tuning of our tree model

Checking performance on training set

Checking model performance on test set

Cost Complexity Pruning

Recall vs alpha for training and testing sets

Checking model performance on training set

Checking model performance on test set

Visualizing the Decision Tree

Comparing all the decision tree models

Business Insights